Table of Contents

House Prices Prediction

Data Overview:

Import Libraries:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
pd.options.display.max_columns = 100

import seaborn as sns
from util import *

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

Read Data:

In [2]:
!ls
House Prices Predictions.ipynb	__pycache__  util.py
In [3]:
train      = pd.read_csv('../0_Data/train.csv')
test       = pd.read_csv('../0_Data/test.csv')

for data in [train, test]:
    print(f'~> [{var2str(data).ljust(5)}] has {bg(data.shape[0])} rows, and {bg(data.shape[1])} columns.')
~> [train] has  1,460  rows, and  81  columns.
~> [test ] has  1,459  rows, and  80  columns.

Data Overview:

In [4]:
train.head()
Out[4]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196.0 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NaN Attchd 2003.0 RFn 2 548 TA TA Y 0 61 0 0 0 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0.0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976.0 RFn 2 460 TA TA Y 298 0 0 0 0 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162.0 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001.0 RFn 2 608 TA TA Y 0 42 0 0 0 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0.0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998.0 Unf 3 642 TA TA Y 0 35 272 0 0 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350.0 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000.0 RFn 3 836 TA TA Y 192 84 0 0 0 0 NaN NaN NaN 0 12 2008 WD Normal 250000
In [5]:
test.head()
Out[5]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub Inside Gtl NAmes Feedr Norm 1Fam 1Story 5 6 1961 1961 Gable CompShg VinylSd VinylSd None 0.0 TA TA CBlock TA TA No Rec 468.0 LwQ 144.0 270.0 882.0 GasA TA Y SBrkr 896 0 0 896 0.0 0.0 1 0 2 1 TA 5 Typ 0 NaN Attchd 1961.0 Unf 1.0 730.0 TA TA Y 140 0 0 0 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6 6 1958 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace 108.0 TA TA CBlock TA TA No ALQ 923.0 Unf 0.0 406.0 1329.0 GasA TA Y SBrkr 1329 0 0 1329 0.0 0.0 1 1 3 1 Gd 6 Typ 0 NaN Attchd 1958.0 Unf 1.0 312.0 TA TA Y 393 36 0 0 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 5 5 1997 1998 Gable CompShg VinylSd VinylSd None 0.0 TA TA PConc Gd TA No GLQ 791.0 Unf 0.0 137.0 928.0 GasA Gd Y SBrkr 928 701 0 1629 0.0 0.0 2 1 3 1 TA 6 Typ 1 TA Attchd 1997.0 Fin 2.0 482.0 TA TA Y 212 34 0 0 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 6 6 1998 1998 Gable CompShg VinylSd VinylSd BrkFace 20.0 TA TA PConc TA TA No GLQ 602.0 Unf 0.0 324.0 926.0 GasA Ex Y SBrkr 926 678 0 1604 0.0 0.0 2 1 3 1 Gd 7 Typ 1 Gd Attchd 1998.0 Fin 2.0 470.0 TA TA Y 360 36 0 0 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub Inside Gtl StoneBr Norm Norm TwnhsE 1Story 8 5 1992 1992 Gable CompShg HdBoard HdBoard None 0.0 Gd TA PConc Gd TA No ALQ 263.0 Unf 0.0 1017.0 1280.0 GasA Ex Y SBrkr 1280 0 0 1280 0.0 0.0 2 0 2 1 Gd 5 Typ 0 NaN Attchd 1992.0 RFn 2.0 506.0 TA TA Y 0 82 0 0 144 0 NaN NaN NaN 0 1 2010 WD Normal

My assumption here, that the organizers gathered all the data together, then they took the first 1,460 rows to be the training data, and the rest to be the testing data, in other words, the data is not shuffled.

Now, let's take an overview of the training data:

In [6]:
summary(train, 'Missing')
~> Dataframe has  1,460  Rows, and  81  Columns.
~> Dataframe has  19  Columns have [Missing] Values.
------------------------------------------------------------
~> There are  35 	 Columns that have [Type] =  int64 
~> There are  3 	 Columns that have [Type] =  float64 
~> There are  43 	 Columns that have [Type] =  object 
Out[6]:
Name dtypes Missing M_Percent Uniques First Value Second Value Third Value Entropy
72 PoolQC object 1453 99.52 3 nan nan nan 1.56
74 MiscFeature object 1406 96.3 4 nan nan nan 0.59
6 Alley object 1369 93.77 2 nan nan nan 0.99
73 Fence object 1179 80.75 4 nan nan nan 1.58
57 FireplaceQu object 690 47.26 5 nan TA TA 1.52
3 LotFrontage float64 259 17.74 110 65 80 68 5.77
59 GarageYrBlt float64 81 5.55 97 2003 1976 2001 6.1
64 GarageCond object 81 5.55 5 TA TA TA 0.29
58 GarageType object 81 5.55 6 Attchd Attchd Attchd 1.35
60 GarageFinish object 81 5.55 3 RFn RFn RFn 1.55
63 GarageQual object 81 5.55 5 TA TA TA 0.34
35 BsmtFinType2 object 38 2.6 6 Unf Unf Unf 0.77
32 BsmtExposure object 38 2.6 4 No Gd Mn 1.42
30 BsmtQual object 37 2.53 4 Gd Gd Gd 1.47
31 BsmtCond object 37 2.53 4 TA TA TA 0.48
33 BsmtFinType1 object 37 2.53 6 GLQ ALQ GLQ 2.34
26 MasVnrArea float64 8 0.55 327 196 0 162 4.26
25 MasVnrType object 8 0.55 4 BrkFace None BrkFace 1.35
42 Electrical object 1 0.07 5 SBrkr SBrkr SBrkr 0.51
0 Id int64 0 0 1460 1 2 3 10.51
55 Functional object 0 0 7 Typ Typ Typ 0.51
56 Fireplaces int64 0 0 4 0 1 1 1.35
53 KitchenQual object 0 0 4 Gd TA Gd 1.43
52 KitchenAbvGr int64 0 0 4 1 1 1 0.29
51 BedroomAbvGr int64 0 0 8 3 3 3 1.71
50 HalfBath int64 0 0 3 1 0 1 1.01
49 FullBath int64 0 0 4 2 2 2 1.18
48 BsmtHalfBath int64 0 0 3 0 1 0 0.32
54 TotRmsAbvGrd int64 0 0 12 8 6 6 2.69
61 GarageCars int64 0 0 5 2 2 2 1.6
46 GrLivArea int64 0 0 861 1710 1262 1786 9.43
62 GarageArea int64 0 0 441 548 460 608 7.75
65 PavedDrive object 0 0 3 Y Y Y 0.48
66 WoodDeckSF int64 0 0 274 0 298 0 4.44
67 OpenPorchSF int64 0 0 202 61 0 42 4.83
68 EnclosedPorch int64 0 0 120 0 0 0 1.52
69 3SsnPorch int64 0 0 20 0 0 0 0.19
70 ScreenPorch int64 0 0 76 0 0 0 0.87
71 PoolArea int64 0 0 8 0 0 0 0.06
75 MiscVal int64 0 0 21 0 0 0 0.36
76 MoSold int64 0 0 12 2 5 9 3.37
77 YrSold int64 0 0 5 2008 2007 2008 2.29
78 SaleType object 0 0 9 WD WD WD 0.78
79 SaleCondition object 0 0 6 Normal Normal Normal 0.97
47 BsmtFullBath int64 0 0 4 1 0 1 1.06
40 HeatingQC object 0 0 5 Ex Ex Ex 1.62
45 LowQualFinSF int64 0 0 24 0 0 0 0.21
11 LandSlope object 0 0 3 Gtl Gtl Gtl 0.34
17 OverallQual int64 0 0 10 7 6 7 2.49
16 HouseStyle object 0 0 8 2Story 1Story 2Story 1.86
15 BldgType object 0 0 5 1Fam 1Fam 1Fam 0.94
14 Condition2 object 0 0 8 Norm Norm Norm 0.11
13 Condition1 object 0 0 9 Norm Feedr Norm 0.9
12 Neighborhood object 0 0 25 CollgCr Veenker CollgCr 4.19
10 LotConfig object 0 0 5 Inside FR2 Inside 1.22
19 YearBuilt int64 0 0 112 2003 1976 2001 6.27
9 Utilities object 0 0 2 AllPub AllPub AllPub 0.01
8 LandContour object 0 0 4 Lvl Lvl Lvl 0.63
7 LotShape object 0 0 4 Reg Reg IR1 1.14
5 Street object 0 0 2 Pave Pave Pave 0.04
4 LotArea int64 0 0 1073 8450 9600 11250 9.7
2 MSZoning object 0 0 5 RL RL RL 1
18 OverallCond int64 0 0 9 5 8 5 1.93
20 YearRemodAdd int64 0 0 61 2003 1976 2002 5.34
44 2ndFlrSF int64 0 0 417 854 0 866 4.63
36 BsmtFinSF2 int64 0 0 144 0 0 0 1.32
43 1stFlrSF int64 0 0 753 856 1262 920 9.2
41 CentralAir object 0 0 2 Y Y Y 0.35
1 MSSubClass int64 0 0 15 60 20 60 2.92
39 Heating object 0 0 6 GasA GasA GasA 0.19
38 TotalBsmtSF int64 0 0 721 856 1262 920 8.98
37 BsmtUnfSF int64 0 0 780 150 284 434 9.01
34 BsmtFinSF1 int64 0 0 637 706 978 486 7.09
21 RoofStyle object 0 0 6 Gable Gable Gable 0.9
29 Foundation object 0 0 6 PConc CBlock PConc 1.52
28 ExterCond object 0 0 5 TA TA TA 0.63
27 ExterQual object 0 0 4 Gd TA Gd 1.19
24 Exterior2nd object 0 0 16 VinylSd MetalSd VinylSd 2.81
23 Exterior1st object 0 0 15 VinylSd MetalSd VinylSd 2.73
22 RoofMatl object 0 0 8 CompShg CompShg CompShg 0.17
80 SalePrice int64 0 0 663 208500 181500 223500 8.85

Wow! This summary table tells us a lot, here's the information we got:

  • We have 19 columns that have missing values.
  • There a bunch of columns that have the same number of missing values such as 81 for 5 columns, 37 for 3 columns, and 38 for 2 columns. We'll investigate more on these columns.
  • Most of the columns that have a huge number of missing values, are object type.
  • Initially, we have 43 categorical columns.
  • We have 40 numeric columns.

Now, let's see the test data:

In [7]:
summary(test, 'Missing')
~> Dataframe has  1,459  Rows, and  80  Columns.
~> Dataframe has  33  Columns have [Missing] Values.
------------------------------------------------------------
~> There are  26 	 Columns that have [Type] =  int64 
~> There are  11 	 Columns that have [Type] =  float64 
~> There are  43 	 Columns that have [Type] =  object 
Out[7]:
Name dtypes Missing M_Percent Uniques First Value Second Value Third Value Entropy
72 PoolQC object 1456 99.79 2 nan nan nan 0.92
74 MiscFeature object 1408 96.5 3 nan Gar2 nan 0.56
6 Alley object 1352 92.67 2 nan nan nan 0.93
73 Fence object 1169 80.12 4 MnPrv nan MnPrv 1.41
57 FireplaceQu object 730 50.03 5 nan nan TA 1.57
3 LotFrontage float64 227 15.56 115 80 81 74 5.81
59 GarageYrBlt float64 78 5.35 97 1961 1958 1997 6.08
63 GarageQual object 78 5.35 4 TA TA TA 0.38
60 GarageFinish object 78 5.35 3 Unf Unf Fin 1.54
64 GarageCond object 78 5.35 5 TA TA TA 0.28
58 GarageType object 76 5.21 6 Attchd Attchd Attchd 1.41
31 BsmtCond object 45 3.08 4 TA TA TA 0.51
30 BsmtQual object 44 3.02 4 TA TA Gd 1.55
32 BsmtExposure object 44 3.02 4 No No No 1.42
33 BsmtFinType1 object 42 2.88 6 Rec ALQ GLQ 2.34
35 BsmtFinType2 object 42 2.88 6 LwQ Unf Unf 0.84
25 MasVnrType object 16 1.1 4 None BrkFace None 1.31
26 MasVnrArea float64 15 1.03 303 0 108 0 4.07
2 MSZoning object 4 0.27 5 RH RL RL 1.06
48 BsmtHalfBath float64 2 0.14 3 0 0 0 0.35
9 Utilities object 2 0.14 1 AllPub AllPub AllPub 0
55 Functional object 2 0.14 7 Typ Typ Typ 0.5
47 BsmtFullBath float64 2 0.14 4 0 0 0 1.08
34 BsmtFinSF1 float64 1 0.07 669 468 923 791 7.17
36 BsmtFinSF2 float64 1 0.07 161 144 0 0 1.44
37 BsmtUnfSF float64 1 0.07 793 270 406 137 8.99
53 KitchenQual object 1 0.07 4 TA Gd TA 1.41
38 TotalBsmtSF float64 1 0.07 736 882 1329 928 8.99
24 Exterior2nd object 1 0.07 15 VinylSd Wd Sdng VinylSd 2.77
61 GarageCars float64 1 0.07 6 1 1 2 1.67
23 Exterior1st object 1 0.07 13 VinylSd Wd Sdng VinylSd 2.71
62 GarageArea float64 1 0.07 459 730 312 482 7.8
78 SaleType object 1 0.07 9 WD WD WD 0.83
75 MiscVal int64 0 0 26 0 12500 0 0.37
51 BedroomAbvGr int64 0 0 7 2 3 3 1.72
52 KitchenAbvGr int64 0 0 3 1 1 1 0.27
77 YrSold int64 0 0 5 2010 2010 2010 2.28
54 TotRmsAbvGrd int64 0 0 12 5 6 6 2.57
76 MoSold int64 0 0 12 6 6 3 3.41
56 Fireplaces int64 0 0 5 0 0 1 1.34
71 PoolArea int64 0 0 7 0 0 0 0.05
50 HalfBath int64 0 0 3 0 1 1 1.01
70 ScreenPorch int64 0 0 75 120 0 0 1.02
69 3SsnPorch int64 0 0 13 0 0 0 0.11
68 EnclosedPorch int64 0 0 131 0 0 0 1.81
67 OpenPorchSF int64 0 0 203 0 36 34 4.88
66 WoodDeckSF int64 0 0 263 140 393 212 4.45
65 PavedDrive object 0 0 3 Y Y Y 0.57
0 Id int64 0 0 1459 1461 1462 1463 10.51
40 HeatingQC object 0 0 5 TA TA Gd 1.6
49 FullBath int64 0 0 5 1 1 2 1.17
46 GrLivArea int64 0 0 879 896 1329 1629 9.48
4 LotArea int64 0 0 1106 11622 14267 13830 9.79
5 Street object 0 0 2 Pave Pave Pave 0.04
7 LotShape object 0 0 4 Reg IR1 IR1 1.1
8 LandContour object 0 0 4 Lvl Lvl Lvl 0.62
10 LotConfig object 0 0 5 Inside Corner Inside 1.17
11 LandSlope object 0 0 3 Gtl Gtl Gtl 0.27
12 Neighborhood object 0 0 25 NAmes NAmes Gilbert 4.22
13 Condition1 object 0 0 9 Feedr Norm Norm 0.94
14 Condition2 object 0 0 5 Norm Norm Norm 0.1
15 BldgType object 0 0 5 1Fam 1Fam 1Fam 0.99
16 HouseStyle object 0 0 7 1Story 1Story 2Story 1.81
17 OverallQual int64 0 0 10 5 6 5 2.53
18 OverallCond int64 0 0 9 6 6 5 1.91
19 YearBuilt int64 0 0 106 1961 1958 1997 6.19
20 YearRemodAdd int64 0 0 61 1961 1958 1998 5.35
21 RoofStyle object 0 0 6 Gable Hip Gable 0.83
22 RoofMatl object 0 0 4 CompShg CompShg CompShg 0.1
27 ExterQual object 0 0 4 TA TA TA 1.23
28 ExterCond object 0 0 5 TA TA TA 0.73
29 Foundation object 0 0 6 CBlock CBlock PConc 1.54
39 Heating object 0 0 4 GasA GasA GasA 0.08
1 MSSubClass int64 0 0 16 20 20 60 2.91
41 CentralAir object 0 0 2 Y Y Y 0.36
42 Electrical object 0 0 4 SBrkr SBrkr SBrkr 0.49
43 1stFlrSF int64 0 0 789 896 1329 928 9.29
44 2ndFlrSF int64 0 0 407 0 0 701 4.55
45 LowQualFinSF int64 0 0 15 0 0 0 0.11
79 SaleCondition object 0 0 6 Normal Normal Normal 0.97

We can see different things here in test data:

  • There are 33 columns have missing values, which is larger than the training data.
  • The of int and float columns are different than the training data.
  • The number of categorical columns are the same as in the training data.
  • Most of the columns that have missing values have the type object.

Descriptive Statistics:

Let's see some descriptive statistics for numeric features and non-numeric features, to make the data more intuitive to use.

But first, let's add the training data and test data together.

In [8]:
test.head()
Out[8]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub Inside Gtl NAmes Feedr Norm 1Fam 1Story 5 6 1961 1961 Gable CompShg VinylSd VinylSd None 0.0 TA TA CBlock TA TA No Rec 468.0 LwQ 144.0 270.0 882.0 GasA TA Y SBrkr 896 0 0 896 0.0 0.0 1 0 2 1 TA 5 Typ 0 NaN Attchd 1961.0 Unf 1.0 730.0 TA TA Y 140 0 0 0 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6 6 1958 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace 108.0 TA TA CBlock TA TA No ALQ 923.0 Unf 0.0 406.0 1329.0 GasA TA Y SBrkr 1329 0 0 1329 0.0 0.0 1 1 3 1 Gd 6 Typ 0 NaN Attchd 1958.0 Unf 1.0 312.0 TA TA Y 393 36 0 0 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 5 5 1997 1998 Gable CompShg VinylSd VinylSd None 0.0 TA TA PConc Gd TA No GLQ 791.0 Unf 0.0 137.0 928.0 GasA Gd Y SBrkr 928 701 0 1629 0.0 0.0 2 1 3 1 TA 6 Typ 1 TA Attchd 1997.0 Fin 2.0 482.0 TA TA Y 212 34 0 0 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 6 6 1998 1998 Gable CompShg VinylSd VinylSd BrkFace 20.0 TA TA PConc TA TA No GLQ 602.0 Unf 0.0 324.0 926.0 GasA Ex Y SBrkr 926 678 0 1604 0.0 0.0 2 1 3 1 Gd 7 Typ 1 Gd Attchd 1998.0 Fin 2.0 470.0 TA TA Y 360 36 0 0 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub Inside Gtl StoneBr Norm Norm TwnhsE 1Story 8 5 1992 1992 Gable CompShg HdBoard HdBoard None 0.0 Gd TA PConc Gd TA No ALQ 263.0 Unf 0.0 1017.0 1280.0 GasA Ex Y SBrkr 1280 0 0 1280 0.0 0.0 2 0 2 1 Gd 5 Typ 0 NaN Attchd 1992.0 RFn 2.0 506.0 TA TA Y 0 82 0 0 144 0 NaN NaN NaN 0 1 2010 WD Normal
In [9]:
train_ids    = train.Id
train_target = train.SalePrice
test_ids     = test.Id
train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)

traintest = pd.concat([train, test], axis=0)
traintest.shape
Out[9]:
(2919, 80)
In [10]:
traintest.describe(exclude='number').T
Out[10]:
count unique top freq
Alley 198 2 Grvl 120
BldgType 2919 5 1Fam 2425
BsmtCond 2837 4 TA 2606
BsmtExposure 2837 4 No 1904
BsmtFinType1 2840 6 Unf 851
BsmtFinType2 2839 6 Unf 2493
BsmtQual 2838 4 TA 1283
CentralAir 2919 2 Y 2723
Condition1 2919 9 Norm 2511
Condition2 2919 8 Norm 2889
Electrical 2918 5 SBrkr 2671
ExterCond 2919 5 TA 2538
ExterQual 2919 4 TA 1798
Exterior1st 2918 15 VinylSd 1025
Exterior2nd 2918 16 VinylSd 1014
Fence 571 4 MnPrv 329
FireplaceQu 1499 5 Gd 744
Foundation 2919 6 PConc 1308
Functional 2917 7 Typ 2717
GarageCond 2760 5 TA 2654
GarageFinish 2760 3 Unf 1230
GarageQual 2760 5 TA 2604
GarageType 2762 6 Attchd 1723
Heating 2919 6 GasA 2874
HeatingQC 2919 5 Ex 1493
HouseStyle 2919 8 1Story 1471
KitchenQual 2918 4 TA 1492
LandContour 2919 4 Lvl 2622
LandSlope 2919 3 Gtl 2778
LotConfig 2919 5 Inside 2133
LotShape 2919 4 Reg 1859
MSZoning 2915 5 RL 2265
MasVnrType 2895 4 None 1742
MiscFeature 105 4 Shed 95
Neighborhood 2919 25 NAmes 443
PavedDrive 2919 3 Y 2641
PoolQC 10 3 Gd 4
RoofMatl 2919 8 CompShg 2876
RoofStyle 2919 6 Gable 2310
SaleCondition 2919 6 Normal 2402
SaleType 2918 9 WD 2525
Street 2919 2 Pave 2907
Utilities 2917 2 AllPub 2916

Checking NaN values:

First let's see the shape of the NaN values in the dataset.

In [11]:
plt.figure(figsize=(20, 14))
sns.heatmap(traintest.isnull(), cbar=False);

We can see some patterns here for NaN values, and we can leverage that for feature engineering, for example:

  • Creating a column contains a number of missing values for each row.
  • Creating a column for each column that has missing values and make the new column has 2 values, True for the row of that column has missing value, and False if that row has a value.

    The meaning of NaN values in the columns:
  • Alley: Type of alley access to property.
    • NaN here means that the house has not alley, hence we can set it to 0 after we encode this feature.
  • BsmtCond: Evaluates the general condition of the basement.
    • NaN here means there's no basement. and so on for each feature, we'll do data investigation.
In [12]:
# Checking NaN values for each Row.
## Remeber: Number of rows = 2,919
## Remeber: Number of cols = 81
traintest.isna().sum(axis=1).value_counts().head(15)
Out[12]:
5     1134
4      771
6      550
3      146
10      92
7       69
11      61
9       36
2       16
8       14
12      10
16       6
15       4
14       4
13       3
dtype: int64

We can see some pattern here, there are 1,134 rows that each row has 5 NaNs, and so on.

In [13]:
# Checking NaN values for each columns.
## Remeber: Number of rows = 2,919
## Remeber: Number of cols = 81
traintest.isna().sum(axis=0).value_counts().head(15)
Out[13]:
0       45
1       11
159      4
2        4
82       2
4        1
1420     1
23       1
24       1
157      1
2814     1
486      1
2348     1
1459     1
79       1
dtype: int64

Half of the columns have no missing values. And there some of them have a huge amount of missing values, and we saw that in the heatmap.

Let's see the columns that have nulls:

In [14]:
nan_cols = [col for col in train.columns if train[col].isnull().sum() > 0]
nan_cols
Out[14]:
['LotFrontage',
 'Alley',
 'MasVnrType',
 'MasVnrArea',
 'BsmtQual',
 'BsmtCond',
 'BsmtExposure',
 'BsmtFinType1',
 'BsmtFinType2',
 'Electrical',
 'FireplaceQu',
 'GarageType',
 'GarageYrBlt',
 'GarageFinish',
 'GarageQual',
 'GarageCond',
 'PoolQC',
 'Fence',
 'MiscFeature']

Checking Constant Columns:

In [15]:
traintest.apply(pd.Series.nunique, axis=0).sort_values(0).head()
Out[15]:
Utilities     2
Street        2
Alley         2
CentralAir    2
HalfBath      3
dtype: int64

Since the minimum number of unique values in the columns is 2, there's no column has constant value.

Checking Duplicated Columns:

In [16]:
len_dup = traintest.loc[:, traintest.columns.duplicated()].shape[1]
print(f'~> The number of duplicate columns = {bg(len_dup, color="red")}')
~> The number of duplicate columns =  0 

Data Investigation & EDA:

Big Picture:

In [17]:
to_visual = train[list(train.columns[1:20])+['SalePrice']].dropna()

grid = sns.PairGrid(data=to_visual, size=3, diag_sharey=False)
grid.map_lower(plt.scatter, alpha=.7)
grid.map_diag(plt.hist)
grid.map_upper(sns.kdeplot, cmap=plt.cm.OrRd_r)
plt.suptitle('Distrubtion', size=32, y=1.05);
In [18]:
to_visual = train[list(train.columns[20:40])+['SalePrice']].dropna()

sns.pairplot(to_visual);
In [19]:
to_visual = train[list(train.columns[40:50])+['SalePrice']].dropna()

sns.pairplot(to_visual);
In [20]:
to_visual = train[list(train.columns[50:65])+['SalePrice']].dropna()

sns.pairplot(to_visual);
In [23]:
def figsize(sz1=8, sz2=6):
    return plt.figure(figsize=(sz1, sz2))
In [24]:
figsize(16, 8)
to_visual = train[list(train.columns[1:])]
sns.heatmap(to_visual.corr());

Observations:

  • There are a lot of outliers in a lot of columns, we'll try to detect them.
  • There are some columns have high correlation with the target column.

SalePrice:

In [25]:
plt.rcParams['font.size'] = 14
plt.figure(figsize=(14, 6))
sns.distplot(train.SalePrice);
In [26]:
train.SalePrice.describe()
Out[26]:
count      1460.000000
mean     180921.195890
std       79442.502883
min       34900.000000
25%      129975.000000
50%      163000.000000
75%      214000.000000
max      755000.000000
Name: SalePrice, dtype: float64

MSSubClass:

In [27]:
plt.figure(figsize=(14, 6))
color = sns.color_palette()[0]
order = train['MSSubClass'].value_counts().index
ax = sns.countplot(x='MSSubClass', 
              data=train, 
              color=color, 
              order=order,
              edgecolor='k')
show_annotation(ax, total=len(train));
In [28]:
plt.figure(figsize=(8, 6))
sns.regplot(data=train, x='MSSubClass', y='SalePrice', x_jitter=.7, y_jitter=.2,
           scatter_kws={'alpha': 1/5});

MSZoning:

In [29]:
def plot_cat_with_target(var, rotate=False):
    figsize(16, 6)
    plt.rcParams['font.size'] = 12

    target = 'SalePrice'

    plt.subplot(131)
    ax = train[var].value_counts().plot.bar(edgecolor='k')
    show_annotation(ax, 20, len(train))
    plt.title(f'Distrubtion of {var}', size=16, y=1.05)

    plt.subplot(132)
    sns.boxplot(data=train, x=var, y=target)
    plt.title(f'Relation b/w {var} & SalePrice', size=16, y=1.05);
    plt.xticks(rotation=(90 if rotate else 0))

    plt.subplot(133)
    sns.swarmplot(data=train, x=var, y=target, color='black', size=3)
    plt.title(f'Relation b/w {var} & SalePrice', size=16, y=1.05)
    plt.xticks(rotation=(90 if rotate else 0));

    plt.subplots_adjust(wspace=.4)
In [30]:
plot_cat_with_target('MSZoning')

LotFrontage:

In [31]:
figsize(16, 6)

plt.subplot(121)
sns.distplot(train['LotFrontage'].dropna(), kde=False);
plt.title('Historgram of LotFrontage', size=20, y=1.05)

plt.subplot(122)
sns.regplot(data=train, x='LotFrontage', y='SalePrice')
plt.title('Relation b/w LotFrontage & SalePrice', size=20, y=1.05);

# plt.subplots_adjust(wspace=.2);
NOTE:
We can see there are 2 points that are greater than 300 feet for the `LotFrontage`considered as outliers, because there's an increasing relation b/w `LotFrontage` and `SalePrice`.

Street:

In [32]:
plot_cat_with_target('Street')

Alley:

In [33]:
train.Alley.fillna('NaN', inplace=True)
test.Alley.fillna('NaN', inplace=True)
In [34]:
plot_cat_with_target('Alley')

There is a lot of houses that don't have alley to the house.

What I can infer is that, this feature does't tell us much about the price of the house.

LotShape:

In [35]:
plot_cat_with_target('LotShape')

LandContour:

                   Flatness of the property

In [36]:
plot_cat_with_target('LandContour')

Utilities:

                   Type of utilities available

In [37]:
plot_cat_with_target('Utilities')

Obervation:
We can see here, that Utilities features, have only one point that's different than AllPub, IMHO, we should remove this feature.

LotConfig:

                   Lot Configuration

In [38]:
plot_cat_with_target('LotConfig')

LandSlope:

                   Slope of property

In [39]:
plot_cat_with_target('LandSlope')

Neighborhood:

                   Physical locations within Ames city limits

In [40]:
figsize(30, 6)
plt.rcParams['font.size'] = 12

var    = 'Neighborhood'
target = 'SalePrice'

plt.subplot(131)
ax = train[var].value_counts().plot.bar(edgecolor='k')
# show_annotation(ax, 20, len(train))
plt.title(f'Distrubtion of {var}', size=16, y=1.05)

plt.subplot(132)
sns.boxplot(data=train, x=var, y=target)
plt.xticks(rotation=45)
# sns.swarmplot(data=train, x=var, y=target, color='black', size=3)
plt.title(f'Relation b/w {var} & SalePrice', size=16, y=1.05);

plt.subplots_adjust(wspace=.2)

Observation:
I can see the box plot follows a kind of sin wave patter, this might help in the feature engineering.

Condition1:

                   Proximity to various conditions

In [41]:
plot_cat_with_target('Condition1', rotate=True)

Condition2:

                   Proximity to various conditions (if more than one is present)

In [42]:
plot_cat_with_target('Condition2', rotate=True)

BldgType:

                   Type of dwelling

In [43]:
plot_cat_with_target('BldgType', rotate=True)

Observation:
You can see, there 2 points, that we can call them outliers, since they have a price larger enough than the normal.



Interesting Observation:
You can see from these plots and all the above plots, the same 2 points appears in the swarmp plot, this is quite interesting, we wouldn't get this observation, from watching a few plots.
Hence, we should remove them.

HouseStyle:

                   Style of dwelling

In [44]:
plot_cat_with_target('HouseStyle', rotate=True)

OverallQual:

                   Rates the overall material and finish of the house

In [45]:
plot_cat_with_target('OverallQual')

Observation:
In the box plot, we see an exponential raising.
When the material is very excellent, the price increases, that's very resonable.

OverallCond:

                   Rates the overall condition of the house

In [46]:
plot_cat_with_target('OverallCond')

Observation:
We see the condition of the house, doesn't increase well when the condition is good, unlike the condition of the material of the house.

YearBuilt & YearRemodAdd:

Since these 2 features are timedate, we can use them very well in the feature engineering.

In [47]:
figsize(16, 6);
sns.jointplot(data=train, x='YearBuilt', y='SalePrice', joint_kws={'alpha': .5});
<Figure size 1152x432 with 0 Axes>

We can see a linear relationship b/w YearBuilt and SalePrice.
And the pearson correlation is 0.5.

In [48]:
figsize(20, 6)
sns.jointplot(data=train, x='YearRemodAdd', y='SalePrice', joint_kws={'alpha': .5});
<Figure size 1440x432 with 0 Axes>

RoofStyle: Type of roof

In [49]:
plot_cat_with_target('RoofStyle', rotate=True)

Observation: Looks like the RoofStyle is not a good parameter to estimate the price of the house, since all the values of the RoofStyle are intertwined.

RoofMatl: Roof material

In [50]:
plot_cat_with_target('RoofMatl', rotate=True)

Observation: Looks like the RoofMatl tells us something about the sale price, we should consider that.

Exterior1st: Exterior covering on house

In [51]:
plot_cat_with_target('Exterior1st', rotate=True)

Exterior2nd: Exterior covering on house (if more than one material)

In [52]:
plot_cat_with_target('Exterior2nd', rotate=True)

MasVnrType: Masonry veneer type

In [53]:
train.MasVnrType.fillna('NaN', inplace=True)
test.MasVnrType.fillna('NaN', inplace=True)

plot_cat_with_target('MasVnrType')

MasVnrArea: Masonry veneer area in square feet

In [54]:
figsize(16, 6)

sns.jointplot(data=train, x='MasVnrArea', y='SalePrice', joint_kws={'alpha': .5})
Out[54]:
<seaborn.axisgrid.JointGrid at 0x7ff6bd14eb38>
<Figure size 1152x432 with 0 Axes>

ExterQual: Evaluates the quality of the material on the exterior

In [55]:
plot_cat_with_target('ExterQual');

Observation:
This feature is very important, since it can distinguish between distinct values for the sale price.

ExterCond: Evaluates the present condition of the material on the exterior

In [56]:
plot_cat_with_target('ExterCond', rotate=True)

Foundation: Type of foundation

In [57]:
plot_cat_with_target('Foundation')

BsmtQual: Evaluates the height of the basement

In [58]:
train.BsmtQual.fillna('NaN', inplace=True)
test.BsmtQual.fillna('NaN', inplace=True)
In [59]:
plot_cat_with_target('BsmtQual')

Observation:
As we can see, as the quality of the basement increases, which cause the price of the house to increase.

BsmtCond: Evaluates the general condition of the basement

In [60]:
train.BsmtCond.fillna('NaN', inplace=True)
test.BsmtCond.fillna('NaN', inplace=True)
In [61]:
plot_cat_with_target('BsmtCond')

Observation:
As we can see, as the condition of the basement increases, which cause the price of the house to increase, and vice versa.

BsmtExposure: Refers to walkout or garden level walls

In [62]:
train.BsmtExposure.fillna('NaN', inplace=True)
test.BsmtExposure.fillna('NaN', inplace=True)
In [63]:
plot_cat_with_target('BsmtExposure')

BsmtFinType1: Rating of basement finished area

In [64]:
train.BsmtFinType1.fillna('NaN', inplace=True)
test.BsmtFinType1.fillna('NaN', inplace=True)
In [65]:
plot_cat_with_target('BsmtFinType1')

BsmtFinSF1: Type 1 finished square feet

In [66]:
figsize()
sns.jointplot(data=train, x='BsmtFinSF1', y='SalePrice', kind='reg');
<Figure size 576x432 with 0 Axes>

BsmtFinType2: Rating of basement finished area (if multiple types)

In [67]:
train.BsmtFinType2.fillna('NaN', inplace=True)
test.BsmtFinType2.fillna('NaN', inplace=True)

plot_cat_with_target('BsmtFinType2')

BsmtFinSF2: Type 2 finished square feet

In [68]:
figsize()
sns.jointplot(data=train, x='BsmtFinSF2', y='SalePrice', kind='reg');
<Figure size 576x432 with 0 Axes>

BsmtUnfSF: Unfinished square feet of basement area

In [69]:
figsize()
sns.jointplot(data=train, x='BsmtUnfSF', y='SalePrice', kind='reg');
<Figure size 576x432 with 0 Axes>

TotalBsmtSF: Total square feet of basement area

In [70]:
figsize()
sns.jointplot(data=train, x='TotalBsmtSF', y='SalePrice', kind='reg');
<Figure size 576x432 with 0 Axes>

Heating: Type of heating

In [71]:
plot_cat_with_target('Heating')

HeatingQC: Heating quality and condition

In [72]:
plot_cat_with_target('HeatingQC')

CentralAir: Central air conditioning

In [73]:
plot_cat_with_target('CentralAir')

Observation:
As we can see this is important feature.

Electrical: Electrical system

In [74]:
plot_cat_with_target('Electrical')

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

In [75]:
figsize()
sns.jointplot(data=train, x='1stFlrSF', y='SalePrice', kind='reg');

sns.jointplot(data=train, x='2ndFlrSF', y='SalePrice', kind='reg');
<Figure size 576x432 with 0 Axes>

LowQualFinSF: Low quality finished square feet (all floors)

In [76]:
sns.jointplot(data=train, x='LowQualFinSF', y='SalePrice', kind='reg');

BsmtFullBath: Basement full bathrooms

In [77]:
plot_cat_with_target('BsmtFullBath')

Observation:
As the number of bathrooms increases, the price of the house increases as well.

BsmtHalfBath: Basement half bathrooms

In [78]:
plot_cat_with_target('BsmtHalfBath')

GrLivArea: Above grade (ground) living area square feet

In [79]:
sns.jointplot(data=train, x='GrLivArea', y='SalePrice', kind='reg');

FullBath: Full bathrooms above grade

In [80]:
plot_cat_with_target('FullBath')

HalfBath: Half baths above grade

In [81]:
plot_cat_with_target('HalfBath')

KitchenQual: Kitchen quality

In [82]:
plot_cat_with_target('KitchenQual')

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

In [83]:
plot_cat_with_target('TotRmsAbvGrd')

Functional: Home functionality (Assume typical unless deductions are warranted)

In [84]:
plot_cat_with_target('Functional')

Fireplaces: Number of fireplaces

In [85]:
plot_cat_with_target('Fireplaces')

FireplaceQu: Fireplace quality

In [86]:
train.FireplaceQu.fillna('NaN', inplace=True)
test.FireplaceQu.fillna('NaN', inplace=True)

plot_cat_with_target('FireplaceQu')

GarageType:

In [87]:
for dataset in [train, test]:
    dataset.fillna('NaN', inplace=True)
In [88]:
plot_cat_with_target('GarageType')

GarageYrBlt: Year garage was built

In [97]:
train.GarageYrBlt = train.GarageYrBlt.astype(np.float32)
sns.jointplot(data=train, x='GarageYrBlt', y='SalePrice', kind='reg');

GarageFinish: Interior finish of the garage

In [98]:
train.GarageFinish.fillna('NaN', inplace=True)
test.GarageFinish.fillna('NaN', inplace=True)
In [99]:
plot_cat_with_target('GarageFinish')

GarageCars: Size of garage in car capacity

In [100]:
plot_cat_with_target('GarageCars')

GarageArea: Size of garage in square feet

In [101]:
sns.jointplot(data=train, x='GarageArea', y='SalePrice', kind='reg')
Out[101]:
<seaborn.axisgrid.JointGrid at 0x7ff6bce506d8>

GarageQual: Garage quality

In [102]:
for dataset in [train, test]:
    dataset.fillna('NaN', inplace=True)
In [103]:
plot_cat_with_target('GarageQual')

GarageCond: Garage condition

In [104]:
for dataset in [train, test]:
    dataset.fillna('NaN', inplace=True)
In [105]:
plot_cat_with_target('GarageCond')

PavedDrive: Paved drivewa

In [106]:
plot_cat_with_target('PavedDrive')

WoodDeckSF: Wood deck area in square feet

In [107]:
sns.jointplot('WoodDeckSF', 'SalePrice', train)
Out[107]:
<seaborn.axisgrid.JointGrid at 0x7ff6bbc803c8>

OpenPorchSF: Open porch area in square feet

In [108]:
sns.jointplot('OpenPorchSF', 'SalePrice', train)
Out[108]:
<seaborn.axisgrid.JointGrid at 0x7ff6bcbac2b0>

EnclosedPorch: Enclosed porch area in square feet

In [109]:
sns.jointplot('EnclosedPorch', 'SalePrice', train);

3SsnPorch: Three season porch area in square feet

In [110]:
sns.jointplot('3SsnPorch', 'SalePrice', train);

ScreenPorch: Screen porch area in square feet

In [111]:
sns.jointplot('ScreenPorch', 'SalePrice', train);

PoolArea: Pool area in square feet

In [112]:
sns.jointplot('PoolArea', 'SalePrice', train);

PoolQC: Pool quality

In [113]:
train.PoolQC.fillna('NaN', inplace=True)
test.PoolQC.fillna('NaN', inplace=True)


plot_cat_with_target('PoolQC')

Fence: Fence quality

In [114]:
train.Fence.fillna('NaN', inplace=True)
test.Fence.fillna('NaN', inplace=True)

plot_cat_with_target('Fence')

MiscFeature: Miscellaneous feature not covered in other categories

In [115]:
train.MiscFeature.fillna('NaN', inplace=True)
test.MiscFeature.fillna('NaN', inplace=True)

plot_cat_with_target('MiscFeature')

MiscVal: $Value of miscellaneous feature

In [116]:
sns.jointplot('MiscVal', 'SalePrice', train);

MoSold: Month Sold (MM)

In [117]:
plot_cat_with_target('MoSold')

YrSold: Year Sold (YYYY)

In [118]:
plot_cat_with_target('YrSold')

SaleType: Type of sale

In [119]:
plot_cat_with_target('SaleType')

SaleCondition: Condition of sale

In [120]:
plot_cat_with_target('SaleCondition', rotate=True)

BedroomAbvGr:

In [121]:
plot_cat_with_target('BedroomAbvGr', rotate=True)

KitchenAbvGr:

In [122]:
plot_cat_with_target('KitchenAbvGr', rotate=True)

Over All Observation:
- There are some features that have dominant values, and the rest of the values are minorities.
- The quality of things matters in terms of price.

Checking:

Number of columns:

In [123]:
ord_cols  = ['LandSlope', 'OverallQual', 'OverallCond', 'ExterQual', 'ExterCond', 'BsmtExposure', 
             'BsmtFinType1', 'BsmtCond', 'BsmtQual', 'BsmtFinType2', 'HeatingQC', 'Electrical', 
             'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'KitchenQual', 'TotRmsAbvGrd', 'Functional',
             'Fireplaces', 'FireplaceQu', 'GarageFinish', 'GarageCars', 'GarageQual', 'GarageCond', 'PavedDrive',
             'PoolQC', 'Fence', 'BedroomAbvGr', 'KitchenAbvGr']
cat_cols  = ['MSSubClass', 'MSZoning', 'Alley', 'LotShape', 'LandContour', 'LotConfig', 
            'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'RoofStyle', 'RoofMatl', 'Exterior1st',
            'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating', 'GarageType', 'MiscFeature', 'SaleType', 
             'SaleCondition', 'HouseStyle']
num_cols  = ['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
            '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
            'EnclosedPorch', '3SsnPorch', 'ScreenPorch']
date_cols = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 'MoSold', 'YrSold']
to_drop   = ['Utilities', 'Street', 'PoolArea', 'MiscVal']
bin_cols  = ['CentralAir']
In [124]:
## Let's check if the number of columns in above lists are equal to the number of colums in train dataset.
print(f'~> Number of train columns:    {bg(len(train.columns)-2)}')
print(f'~> Number of observed columns: {bg(len(ord_cols+cat_cols+num_cols+date_cols+to_drop+bin_cols))}')

print(f'> They are equal: {len(ord_cols+cat_cols+num_cols+date_cols+to_drop+bin_cols) == len(train.columns)-2}')
~> Number of train columns:     78 
~> Number of observed columns:  79 
> They are equal: False
In [125]:
# Check for columns that are not included in our lists.
for col in train:
    if (col not in ord_cols) and (col not in cat_cols) and (col not in num_cols) and (col not in date_cols) and (col not in to_drop) and (col not in bin_cols):
        if col not in ['Id', 'SalePrice']:
            print(col)

print('All Columns are included!')
All Columns are included!

Chekcing Null Columns:

In [132]:
[col for col in train.columns if train[col].isnull().sum() > 0]
Out[132]:
[]
In [133]:
plt.figure(figsize=(20, 14))
sns.heatmap(train.isnull(), cbar=False);

LotFrontage:

In [138]:
train.LotFrontage = train.LotFrontage.astype(np.float)
train.LotFrontage.describe()
Out[138]:
count    1201.000000
mean       70.049958
std        24.284752
min        21.000000
25%        59.000000
50%        69.000000
75%        80.000000
max       313.000000
Name: LotFrontage, dtype: float64
In [139]:
## Number of rows has null value in `LotFronage`
train.LotFrontage.isnull().shape[0], test.LotFrontage.isnull().shape[0]
Out[139]:
(1460, 1459)
In [140]:
# Fill the missing value with the median.
train.fillna(train.median(), inplace=True)
test.fillna(test.median(), inplace=True)
In [142]:
figsize(16, 6)

plt.subplot(121)
sns.distplot(train['LotFrontage'], kde=False);
plt.title('Historgram of LotFrontage', size=20, y=1.05)

plt.subplot(122)
sns.regplot(data=train, x='LotFrontage', y='SalePrice')
plt.title('Relation b/w LotFrontage & SalePrice', size=20, y=1.05);
In [143]:
## Sanity Check
train.LotFrontage.isna().sum(), test.LotFrontage.isna().sum()
Out[143]:
(0, 0)

Electrical:

In [144]:
## Number of rows has null value in that column
train.Electrical.isna().sum(), test.Electrical.isna().sum()
Out[144]:
(0, 0)
In [145]:
train.Electrical.fillna('SBrkr', inplace=True)
In [146]:
## Sanity Check
train.Electrical.isna().sum(), test.Electrical.isna().sum()
Out[146]:
(0, 0)
In [147]:
plt.figure(figsize=(20, 14))
sns.heatmap(train.isnull(), cbar=False);

Save Resultant Files:

In [149]:
train.isnull().sum().sum(), test.isnull().sum().sum()
Out[149]:
(0, 0)
In [151]:
train['Id'] = train_ids
train['SalePrice'] = train_target
In [153]:
train.to_hdf('train_null_removed.h5', 'data', mode='w')
test.to_hdf('test_null_removed.h5', 'data', mode='w')